1 Abstract

Adelaide’s population is increasing, resulting in higher road congestion and its associated costs. As the residents of Adelaide are becoming more reliant on public transport, it is critical to quantify the resilience of the public bus network to the aforementioned rise in road congestion. The analyses contained in this report seeks to provide the South Australia Department for Infrastructure and Transport (DIT) such an estimate by examining the relationship between the travel times of both motor vehicles and bus trips, especially during morning and evening rush hours. The core objective of the analysis is to determine for a specified road segment: the extent to which the variation of motor vehicle travel times from normal relative to the same time frame, is matched with such a variation in bus travel times, where a lower correlation indicates a more resilient bus network to increasing road congestion. Bus travel times are calculated using trip updates data obtained from the open source General Transit Feed Specification Realtime (GTFSR) by calculating the time taken between the first and last stops of the trips on the segment, while the motor vehicle travel times are calculated using links metrics available through DIT’s Addinsight data lake. Upon completing the descriptive analysis of the travel times relationship, the travel times were standardized relative to the time of day to better examine the response of the bus travel times to variations in the motor vehicle travel times. The analysis was performed on South Road for the period of March 2022, for both the northbound direction towards the city, and the southbound direction away from the city. The analysis shows that the evening southbound correlation is strong between the standardized travel times, however the variations are relatively small, therefore it is difficult to infer the bus transportation robustness to congestion as larger variation magnitudes would have to be observed and examined. It also finds that the morning standardized travel times towards the city are more varied and the correlation is lower, implying the bus transportation is relatively more robust to congestion than evening bus travel from the city.


2 Background

Adelaide’s population increased from 1.1 million to 1.3 million residents between 2006 and 2016, with 66 million more kilometers traveled on the road network during that time. Infrastructure Australia paints a dire picture of the level of road congestion in Adelaide and its continued worsening in the coming years in line with both an increasing population and an increasing reliance on public transport in comparison to cars. The report estimated the annualized cost of road congestion for Greater Adelaide to be approximately $1.4 billion in 2016 and is projected to rise to $2.6 billion in 2031 (Infrastructure Australia, 2019).

With this backdrop in mind, the client - DIT - has in its possession an untapped wealth of data relating to traffic information collected through Bluetooth probes, which take count of passing motor vehicles in a particular time and location, therefore producing a metric for road congestion.

This data will be examined in conjunction with publicly available, historical real time bus trip updates collected by GTFSR, which provide the predicted arrival time for each stop on a bus’s trip. The analysis aims to identify the relationship and robustness of bus travel times to road congestion on road segments of interest, especially during peak times.


3 Objectives

The aim of the proposed analysis is to investigate the extent of the relationship between bus travel times and road congestion - as measured by motor vehicle travel times - on identified road segments, where a strong relationship indicates a road segment where the bus travel times are less robust to congestion.

Initially, the bus performance metric to be used and applied was the average delay experienced by a bus trip on the segment of interest, as measured by a stop’s predicted arrival time versus the scheduled arrival time. However, the metric was later revised to the bus travel time between the first and last stops of a segment, removing the possibility that we are measuring how accurately the schedule predicts and/or buffers for congestion.

A proposal outlining the analysis, the objectives, and the methodology was created and sent to the client (found here), this was followed by a discussion with the client to provide more information regarding the analysis and clarify any points raised by them. An agreement was reached for the analysis to fulfill the following objectives:

  1. Detailed travel time or congestion analysis comparing public transport response to road traffic on selected sections of road over a given period of time, especially during peak hours

  2. Repeatable methodology, code, functions, and visuals that produce detailed analysis on other segments of interest

In fulfilling the first objective, the segment of road analysed is South Road in Adelaide. The period of time chosen is March 2022.

Regarding the second objective, the methodology and the code created aim to ensure as little manual input and edits as possible when applied to different road segments.

The analysis undertaken in this report will form the basis of future analysis into:

  • additional road segments of interest to generate a ranking of bus network robustness which can help inform the allocation of resources

  • the rate of change of bus resilience to congestion by examining prior periods of the same road segment

  • the impact of road works on bus performance by conducting and comparing the results of both pre- and post-works analysis

  • identifying the factors that can affect bus travel times such as the use of bus lanes, number of bus stops, traffic lights, etc.

  • creating a model predicting bus travel times using identified features.


4 Data Sources, Description, and Wrangling

Three main data sources are used: DIT Addinsight, General Transit Feed Specification (GTFS), and GTFSR. These data sources and their associated sub-sources will be outlined below. All the data is stored in the cloud using Amazon Web Services (AWS), and is accessed and retrieved through Athena which uses regular SQL syntax.

The data cleaning and wrangling will be discussed as it was the part of the analysis that required the highest workload.

As mentioned above, the methodology will be illustrated on South Road, which is one of Adelaide’s most important and major roads, and regularly suffers from congestion (Infrastructure Australia, 2019).


South Rd on map. Source: Google Maps

Figure 4.1: South Rd on map. Source: Google Maps


4.1 GTFS

This is a common format developed by Google and used by public transport agencies around the world and contains static or scheduled information about public transport services such as routes, stops, schedule and geographic transit information. For the purposes of this analysis, only the bus routes and bus stops datasets will be used.

4.1.1 Routes

These are the bus routes that go through South Road. The routes were identified by overlaying all the network routes on a map in Tableau and the routes on South Road were manually highlighted and exported to a list. The dataset simply contains the unique collection of route_ids on the segment.

4.1.2 Stops

The list of bus stops on the segment were identified using Tableau in the same fashion as when identifying the routes.

A dataset containing all the stops in the bus network and information relating to each stop is used and filtered to only the stops present on the segment (for the purposes of this report, the file containing all the stops on the network was pre-filtered to the stops on the segment only to accommodate Github file size limits. However, the code and methodology contained here apply as if the complete dataset were used and filtered through the code).

Table 4.1: Stops data description
Variable Description
stop_id Unique stop identifier
stop_name Name of the location. Uses a name that people will understand
stop_desc Address of the stop
stop_lat Latitude of the stop
stop_lon Longitude of the stop
direction Road direction of the stop

The direction variable is manually created. In this case, if the stop is on the east side of South Road, then it is southbound (SB) away from the city; if the stop is on the west side of South Road, then it is northbound (NB) towards the city.

The bus stops will be plotted on a map to confirm they are all, in fact, on South Road.


Figure 4.2: Bus stops on South Road


4.2 GTFSR - Trip Updates

Unlike GTFS which provides static information, GTFSR provides real time information consisting of two types. The first type is a trip’s real time updates regarding a bus stop’s expected arrival times and delays. The second type is a real time update of a bus’s geographic position and speed at a specific point in time. This analysis uses the former only.

Once the bus routes that go through the segment were identified as outlined above, the real time updates for all the trips in March 2022 according to the routes were retrieved from the AWS database using Athena. This dataset is used to derive the bus travel time through the segment, which is the first element in the relationship being assessed in this analysis, with the other being the vehicle travel time as a measure of congestion.

First, the unedited data will be described.

Table 4.2: Unedited updates data description
Variable Description
route_id Unique route identifier
start_date Start date of the trip
vehicle_id Unique vehicle identifier
timestamp Timestamp of the real time update
trip_id Unique trip identifier
stop_sequence Order of stops for a particular trip
stop_id Unique stop identifier
delay The current schedule deviation for the trip. The delay (in seconds) can be positive (meaning that the vehicle is late) or negative (meaning that the vehicle is ahead of schedule)
arrival_time Predicted arrival time for a stop on a particular trip

It is important to note the following:

  • One route_id can have many trip_ids

  • One trip_id occurs a maximum of one time a day, the trip_id can occur on multiple days

As a bus trip is occurring, at certain time intervals a real time prediction of the arrival times of the upcoming stops on the trip are updated.

Cleaning and wrangling this dataset proved to be the most challenging and time consuming section of this analysis, with many methodologies, cleaning iterations, and code trialed to arrive at the optimal treatment. This is due to the complex relationships between the observations in the dataset, and the variety of errors and inconsistencies encountered.

The following preliminary adjustments were made:

  1. As each stop on a given trip can have multiple arrival time predictions with each update timestamp prior to reaching that stop, the SQL query insures that each stop only has the predicted arrival time corresponding to the latest update timestamp, given that the later the prediction, the more accurate it is

  2. As a trip can begin and end outside the bounds of the segment, the updates were constrained only to those stops within the segment, in either direction

  3. Weekends and holidays were removed as we are interested in the relationships during working days only

A new variable to_stop_time was created. This variable measures the time taken to reach each stop from the prior stop in seconds, within each trip. The variable was created to facilitate a potential deeper understanding of the data, to highlight any errors, and for potential utilities in the future such as drilling down to examine the patterns on a stop-basis.

Through this variable, a range of errors were discovered that needed to be amended. This is how the data appears before any remedial actions are taken.


Unedited to-stop times contain negative values

Figure 4.3: Unedited to-stop times contain negative values


Figure 4.3 shows that to_stop_time contains negative values to the left of the red line, this is a clear error as it is not possible for the time taken to reach a stop to be negative. Additionally we can see very high delay values in clusters above 70 minutes.

In total, there were eight types of errors identified in the data. The list of errors, an example of each error, and the code to rectify the errors can be found in appendix 7.1.

Great effort was put into identifying each type of error and remedying it in a way that does not produce further errors, or that removes large amounts of data; identifying the correct order of the types of errors to be tackled was also essential. In addition, formulating the code to fix each error required various trial and error iterations. This was all done to ensure the errors were removed as surgically as possible to minimize data loss and due to the sensitive nature of the relationships between the stops on each trip.

The percentage of error entries located and fixed in the data was 3.83%. The cleaned data now appears as follows:


Cleaned to-stop times do not contain negative values

Figure 4.4: Cleaned to-stop times do not contain negative values


With the data now cleaned, two additional variables were created called first_stop and last_stop, which identify the first and last stops of each trip within the segment. The total time per trip can now be derived by calculating the time between the first stop and last stop of the trip within the segment. The arrival time of the first stop and last stop on the segment will be regarded as the start and end time, respectively, of the trip. The distribution of the trip times per direction is shown below. The two most occurring first-last stops pair per direction are used here.


Different stops pairs in the same direction have different trip times

Figure 4.5: Different stops pairs in the same direction have different trip times


As figure 4.5 shows, different first-last stops pairs in the same direction have different travel times. This means that different trips can have different travel times solely based on their respective first and last stops on the segment, this renders the travel time between them incomparable as they occupy different distances. Therefore, only trips with the same pair of first and last stops within the segment will be kept, with the remaining trips discarded; there can be only one pair of first and last stops per direction, so that the distance is constant for all the trips and the time is therefore comparable.

This pair of stops is identified as the most occurring pair per direction. Now, only trips with this pair of first and last stops are kept in the data. The stops pair per direction can be seen in the map below:


Figure 4.6: Most occurring pair of bus stops per direction


The distribution of the trip times per direction is shown below:


Excessively large trip times exist, especially southbound

Figure 4.7: Excessively large trip times exist, especially southbound


As figure 4.7 shows, excessive trip times occur. It is difficult to determine whether these are errors or genuine trip times without using further information. A variable called delay_diff is created which calculates the size of the difference between the delay of the first stop and the delay of the last stop per trip. Excessive values of this variable indicate the large travel time is due to an error as either of the stops has an artificially large delay or early arrival. A histogram of delay_diff is shown below:


Size of difference between first and last stop delays

Figure 4.8: Size of difference between first and last stop delays


Based on figure 4.7, trips with a delay_diff greater than 10 minutes were removed as they were most likely errors. The resulting data now appears as follows:


Excessively large trip times no longer exist

Figure 4.9: Excessively large trip times no longer exist


The data in the bus travel times will be split into five minute time periods, with the arrival time of the first stop on the trip used as the basis for this segregation. For example, all bus trips that start between 2022-03-01 12:00:00 and 2022-03-01 12:05:00 will be included in the same time frame. Since each time frame can contain multiple trips, the bus travel times will be averaged into one average bus travel time, this is done to establish a one-to-one relationship with the vehicles travel time, which are also in five minute intervals. The final dataset looks as follows:

Table 4.3: Aggregated bus trip travel times data description
Variable Description
day Date of measurement
time Time of the day in hour:minute:seconds of the measurement
hour The hour of the measurement
rush Whether the measurement occurrs during rush hour. Morning rush hour occurrs between 6:30am and 10am, evening rush hour occurrs between 3:30pm and 7pm, neither otherwise
direction The direction of travel
number_buses The original number of trips during the five minute interval
bus_time The bus trip travel time across the segment


4.3 DIT Addinsight

This is traffic information collected by DIT Addinsight, which is done through the use of Bluetooth devices that tag a Bluetooth-equipped vehicle when it comes into its range. The location of a Bluetooth device is called a site, and a link is a segment of road between two sites, an origin site and a destination site. This allows for the calculation of metrics such as the time taken to travel through the link, among others.

The DIT Addinsight database is very large and contains many tables, each recording its own set of information, with foreign keys connecting most tables. This analysis uses only a subset of the tables in the database, presented below.


4.3.1 Holidays

This dataset contains holidays dates, which is the only variable used.


5 Analysis

5.1 Travel Times Comparison

The travel times from both sets will be compared against one another. This is done to gain a general understanding of the relationship as well as to validate the datasets, as we would expect to observe a similar pattern between both travel times. The comparison will be done through a series of graphs.

Vehicles are faster in both directions. Distributions of both types resemble each other

Figure 5.1: Vehicles are faster in both directions. Distributions of both types resemble each other


From figure 5.1 we learn the following:

  • For northbound travel to the city, vehicle travel time largely remains the same during both periods of rush hour, while bus travel time actually increases in the evening, a surprising result

  • For southbound travel away from the city, both travel times in the evening increase as expected and are more varied than the travel times in the morning

  • During the morning rush hour, vehicle travel times take longer towards the city, while bus travel times are roughly the same between both directions

  • During the evening rush hour, both travel types takes longer towards away from the city as expected

  • The bus travel times are generally slower than vehicle travel times as expected, and both types exhibit similar patterns overall


Northbound bus travel times are slower than vehicle travel times. Both exhibit similar patterns

Figure 5.2: Northbound bus travel times are slower than vehicle travel times. Both exhibit similar patterns


Southbound bus travel times are slower than vehicle travel times. Both exhibit similar patterns

Figure 5.3: Southbound bus travel times are slower than vehicle travel times. Both exhibit similar patterns


Both figures 5.2 and 5.3 show that the travel times from both types generally follow a similar pattern, this indicates the data from both data sets are valid, as we would not expect to see very different patterns. Buses are almost always slower than vehicles, as buses need to load and unload passengers at the various bus stops along the road, in addition to them accelerating at a slower rate since they are heavy vehicles. We also notice that towards the end of the day in both directions, the travel times seem to level off at a low value, this is likely due to less traffic being present on the road in the evening time, leading to a faster traversal through the segment, with only constant factors affecting the travel time such as the speed limit and traffic lights. Note that the y-axis travel times between figures are on independent scales.

To gain a clearer picture of the patterns and relationship throughout an average day, the travel times within 30 minute aggregates of the same time frame will be averaged across all the days. For example, for each travel type, all the measures occurring between 12:00 and 12:30 across all the days will be averaged, then plotted.


Average travel time patterns by vehicle and direction. Peak times are highlighted

Figure 5.4: Average travel time patterns by vehicle and direction. Peak times are highlighted


Figure 5.4 shows the average pattern of travel times across the day, by direction and type. The morning and evening rush hours have been highlighted as they are the parts of the day of interest. Analyzing northbound travel towards the city, both rush hour times display a similar level of travel time for both types, and the travel time in the rush hours are not much greater than non-rush hour times. This is an unexpected result as it is expected that travel times northbound towards the city would be higher in the morning rush hour. Southbound travel away from the city, however, follows expectations as the travel time for both types dramatically increases in the evening rush hour as workers leave the city.

A scatter plot of the travel times will be examined:


A positive relationship exists between both travel times for both directions during peak times

Figure 5.5: A positive relationship exists between both travel times for both directions during peak times


Figure 5.5 shows that a positive relationship exists between the travel times, more so for the southbound direction in the evening.

The correlation figures between the travel times are:

Table 5.1: Correlation between travel times per direction per rush hour
Rush Direction Correlation
Morning NB 0.79
Morning SB 0.49
Evening NB 0.67
Evening SB 0.87

Table 5.1 shows that the travel times between buses and vehicles are highly correlated in the morning northbound towards the city, and in the evening southbound away from the city.


5.2 Travel Times Variation Analysis

The goal of the analysis is to ascertain the extent of the relationship between the variations in the motor vehicle travel times and the variations in the bus trip travel times, the variation is in reference to travel times during the same time frame across the entire period. In other words, if the vehicle travel time varies by a certain level relative to the usual travel time during the same time frame, can we observe a reflection of this variation in the bus travel time? If so, by how much?

In order to assess the variation, the travel times will be standardized. The function standardiser is created which separately standardizes both the bus travel times and the vehicle travel times according to the total data in the entire period based on either:

  • the five minute time frame. For example, a bus/vehicle travel time on 2022-03-01 between 7am and 7:05am would be standardized against all the other travel times that occur between 7am and 7:05am in the period

  • the hour of travel. For example, a bus/vehicle travel time that occurs on 2022-03-01 between 7am and 8am would be standardized against all the other bus trips in the period that occur between 7am and 8am

  • the rush hour of travel. For example, a bus/vehicle travel time that occurs on 2022-03-01 during the morning rush hour would be standardized against all the other bus trips in the period that occur during the morning rush hour

These options are provided to the function as an argument (time, hour, rush). As the time frame widens, more data is available for standardization, but the standardization takes a wider time range, leading to increased bias. This is why in addition to standardizing the data, the standardiser function also stores the total number of data points present in each time frame according to the method chosen. The function also removes observations greater than three standard deviations away as these are considered outliers that can affect the analysis.

Ideally, the travel times would be standardized according to the same five minute time frame across the entire period as this would provide the highest accuracy. However, as the bus trips per five minute time frame were averaged into one five minute travel time, and we are analyzing only one month of data containing 21 working days, there is not enough travel times to accomplish this, since there would be a maximum of 21 data points per five minute time frame used for standardization. Instead, the default standardization parameter is by hour, which provides a much greater number of data points at the cost of some bias.

With the bus and vehicle travel times standardized, we can now examine the relationship between the travel times with respect to variation. If the vehicle travel time deviates from the average relative to the time of day, do we observe a similar deviation by the bus travel time?

Plots of the standardized travel times are shown below:


Northbound morning travel time variations are similar

Figure 5.6: Northbound morning travel time variations are similar


Southbound evening travel time variations are similar

Figure 5.7: Southbound evening travel time variations are similar


Figure 5.6 and figure 5.7 show that variations in vehicle travel times are in fact closely matched by variations in bus travel times.

The distribution of the standardized travel times are shown below:


Greater variation is present in the northbound morning travel times than southbound evening travel times

Figure 5.8: Greater variation is present in the northbound morning travel times than southbound evening travel times


Figure 5.8 shows that travel times within the same time period have relatively greater variation for both types in the morning towards the city, while the travel times within the same time period in the evening away from the city show much less variation, especially vehicles. The variations here are not to be confused with the variations across the entire rush time across the entire period as shown in figure 5.1. As a reminder, the standardization in the plot above is performed relative to travel times in the same hour across the entire period, and is therefore more specific.


A positive relationship exists between both travel times for both directions during peak times

Figure 5.9: A positive relationship exists between both travel times for both directions during peak times


Figure 5.9 shows that the standardized travel times are particularly correlated in the evening southbound away from the city.

The correlation figures between the standardized travel times are :

Table 5.2: Correlation between standardized travel times per direction per rush hour
Rush Direction Correlation
Morning NB 0.65
Morning SB 0.23
Evening NB 0.43
Evening SB 0.74

Table 5.2 shows that strong correlation exists between the standardized travel times during the evening southbound away from the city.


6 Conclusion, Findings, and Future Directions

The report examined the relationship between the bus travel times and motor vehicle travel times, and the robustness of the bus network to road congestion, particularly during rush hour periods. A major takeaway from the project was the recognition of the extensive workload incurred in cleaning and wrangling the data, a common theme in many data analysis projects, as well as in creating the methodology to perform the analysis. A couple of limitations exist in the data. The first is that the trip updates dataset provides the predicted arrival times of the bus stops, whereas the actual arrival times can provide greater accuracy. Another limitation is that the period of the analysis is constrained to one month only due to file size limitations, where a longer time frame can provide greater certainty in the results.

An additional objective of the project was to create the analysis code in such a way as to allow reproducibility with minimal manual input across different road segments and time periods; in fact, the analysis was conducted on an additional road in Adelaide (Marion Road) to enable comparison with South Road and to highlight any possible enhancements to the code. The result of that analysis was not included in this report due to report length limitations, however a modified report that includes and compares both roads was created and provided to the client (found in the Github repository).

The results from the analysis on South Road show the following:

  • The absolute travel times in the evening rush hour away from the city are much greater and more varied than those in the morning towards the city

  • The standardization of the the travel times with respect to the time of day indicates that while evening travel times from the city are greater in absolute terms, they are consistently so. Whereas the variation of the morning travel time towards the city has a wider range while taking less time in absolute terms

  • The evening southbound correlation is strong between the standardized travel times, however the variations are relatively small, therefore it is difficult to infer the bus transportation robustness to congestion as larger variation magnitudes would have to be observed and examined

  • The morning standardized travel times towards the city are more varied and the correlation is lower, implying the bus transportation network is relatively more robust to congestion than the evening bus travel from the city

Possible future directions stemming from this analysis include:

  • performing the analysis on more roads to create a ranking of bus network robustness to road congestion. This can result in a prioritization list of resource allocation to enhance the bus network

  • conducting the analysis on previous periods to provide an indication of the rate of change of bus network resilience to congestion

  • identifying the factors that can affect bus travel times

  • creating a predictive model for bus travel times using the identified features


7 Appendix

7.1 Bus Trips Updates Errors

  1. All stops for the trip have very large, or very small, similar delays. This means that the entire trip is very delayed or very early. This is most likely due to errors when retrospectively entering the information at a later time. By examining 4.3, the threshold was set at 2,400 seconds (40 minutes) delay and 900 seconds (15 minutes) early. These trips were removed to prevent incorrect analysis since they will be in the wrong time period. These trips were removed as they will be in the wrong timeframe when analyzed against the vehicle travel time.


  1. A stop has a sudden large predicted delay resulting in a much larger arrival time than that of the following stop. These stops were removed


  1. A stop has an arrival time later than any following stops and the timestamp is earlier than any following stops. These stops were removed to ensure the most recent timestamp is preferred when discrepancy occurs


  1. A stop’s arrival time is earlier than previous stops and the timestamp is older


  1. A stop’s arrival time is earlier than prior stops but they both have the same timestamp. In this case it is not possible to know which is correct. We assume the stop with the earlier stop sequence is correct since it is closer to the bus when the update is made


  1. Two consecutive stops have identical arrival times but with different timestamps. The stop with the older timestamp was removed


  1. Two consecutive stops have identical arrival times and timestamps. Remove the stop with a higher stop sequence


  1. Many stops on the same trip have the same arrival time likely due to retrospective entry error. These trips were removed


8 References